Exploratory / Confirmatory Factor Analysis?

Biostatistics Psychology/Sociology EFA Model Fit Statistics

Correlation matrix;
Exploratory Factor Analysis vs. Confirmatory Factor Analysis; Run An Example with Categorical Data in R (psych package);

Hai Nguyen
September 19, 2022

Why use factor analysis?

Factor analysis is a useful tool for investigating variable relationships for complex concepts such as socioeconomic status, dietary patterns, or psychological scales.

It allows researchers to investigate concepts that are not easily measured directly by collapsing a large number of variables into a few interpretable underlying factors.

What is a factor?

In factor analysis, a factor is an latent (unmeasured) variable that expresses itself through its relationship with other measured variables.

Take for example a variable like cognition. We may want to measure a person’s cognition, but this is the kind of construct that would be impossible to measure using a single variable. It’s just too abstract and multifaceted, although it does represent a single concept. So instead, you may have to develop the scale with many items, each of which measures some more measurable part of leadership. The idea would be that there is an underlying unmeasurable factor, cognition, that causes people to respond in certain patterns on the many items on the scale.

The purpose of factor analysis is to analyze these patterns of response as a way of getting at this underlying factor. Factor analysis also allows you to use the weighted item responses to create what are called factor scores. These represent a single score for each person on the factor. Factor scores are nice because they allow you to use a single variable as a measure of the factor in the other analyses, rather than a set of items.

Factor analysis often applied for continuous variables, but for nominal variables we would have no problem when using fa() in library psych

\(\Rightarrow\) As demonstrated below, using ordinal or nominal/binary data for factor analysis in R is no more difficult than using continuous data for factor analysis in R.

If variables include mixed variables. If that is the case use cor="mixed" in fa() of psych library

A case study of Depression data of ordinal variables

Now, we jump on an example: Depression data (NHANES, 2014-15) includes 10 ordinal variables. This approach would be better helping us see through the meaningfulness of FA.

Few more variables of different ordinal scales, which are not used in this example
* more critical/serious variables

Values of each variable has the same format of “ordinal” numerical values:

Higher is the value \(\Rightarrow\) “more” likely to be depressed ???

Read raw data into R

options(width = 300)
dep <- read.csv("data/dep.csv",header = TRUE, 
                sep = ",", na.strings = "NA")
# na.strings : a character vector of strings which are to be 
#             interpreted as NA values.
head(dep)
   SEQN DPQ010 DPQ020 DPQ030 DPQ040 DPQ050 DPQ060 DPQ070 DPQ080 DPQ090 DPQ100 SLD010H SLQ050 SLQ060
1 73557      1      0      0      0      0      0      0      0      0      1       7      1      2
2 73558      2      0      0      0      0      0      0      0      0      0       9      2      2
3 73561      2      1      0      3      3      0      0      0      0      1       9      2      2
4 73562      3      3      3      3      3      1      2      1      0      3       5      2      1
5 73564      0      1      0      1      0      0      0      0      0      0       9      2      1
6 73566      0      0      1      0      0      0      0      0      0      0       6      1      2
nrow(dep)
[1] 3651
x <- dep[,2:11]

We first check our correlation matrix. Visual analysis of the correlation matrix is done to reveal some similarities or correlations that can be found in the data between groups of questions.

R <- cor(x)
library(corrplot)
corrplot(R,type="upper",order="hclust")

The blue structures are positively correlated groups and the red structure represents a negatively correlated group that emerges among the data.

FA using fa() form psych

library(psych)
library(GPArotation)
head(x)
  DPQ010 DPQ020 DPQ030 DPQ040 DPQ050 DPQ060 DPQ070 DPQ080 DPQ090
1      1      0      0      0      0      0      0      0      0
2      2      0      0      0      0      0      0      0      0
3      2      1      0      3      3      0      0      0      0
4      3      3      3      3      3      1      2      1      0
5      0      1      0      1      0      0      0      0      0
6      0      0      1      0      0      0      0      0      0
  DPQ100
1      1
2      0
3      1
4      3
5      0
6      0

Important note for FA using psych lib of different types of variables

fa(data,n.factors) has an additional argument: cor with several options depending on input variable types

Remarks

May convert nominal variables into dummy variables but that increases dimensionality very quickly

Assumption for ordinal variable, it has latenent normal variable
Correlations between non-numeric variables is often related chi-square test stats

Since our variables are all ordinal lets use polychoric correlations

polychoric(x)
Call: polychoric(x = x)
Polychoric correlations 
       DPQ01 DPQ02 DPQ03 DPQ04 DPQ05 DPQ06 DPQ07 DPQ08 DPQ09 DPQ10
DPQ010 1.00                                                       
DPQ020 0.58  1.00                                                 
DPQ030 0.31  0.33  1.00                                           
DPQ040 0.38  0.38  0.37  1.00                                     
DPQ050 0.38  0.43  0.34  0.34  1.00                               
DPQ060 0.47  0.70  0.30  0.36  0.43  1.00                         
DPQ070 0.47  0.51  0.34  0.35  0.41  0.51  1.00                   
DPQ080 0.48  0.51  0.41  0.37  0.41  0.51  0.59  1.00             
DPQ090 0.40  0.65  0.34  0.37  0.37  0.70  0.49  0.49  1.00       
DPQ100 0.57  0.66  0.42  0.47  0.46  0.61  0.56  0.59  0.61  1.00 

 with tau of 
            1    2   3
DPQ010  0.296 1.06 1.5
DPQ020  0.368 1.22 1.6
DPQ030 -0.096 0.74 1.1
DPQ040 -0.665 0.68 1.1
DPQ050  0.333 1.07 1.5
DPQ060  0.660 1.36 1.7
DPQ070  0.635 1.26 1.6
DPQ080  0.980 1.52 1.9
DPQ090  1.646 2.12 2.4
DPQ100  0.629 1.60 2.1

Output has

fa for ordinal variables of more than two values.
In this case, assuming two factors.

fa.out <- fa(dep[,2:11], nfactors=2, cor='poly', rotate = "oblimin")
fa.out
Factor Analysis using method =  minres
Call: fa(r = dep[, 2:11], nfactors = 2, rotate = "oblimin", cor = "poly")
Standardized loadings (pattern matrix) based upon correlation matrix
         MR2   MR1   h2   u2 com
DPQ010  0.54  0.15 0.44 0.56 1.1
DPQ020  0.19  0.67 0.69 0.31 1.2
DPQ030  0.69 -0.18 0.32 0.68 1.1
DPQ040  0.61 -0.06 0.32 0.68 1.0
DPQ050  0.54  0.05 0.34 0.66 1.0
DPQ060 -0.05  0.90 0.75 0.25 1.0
DPQ070  0.58  0.14 0.49 0.51 1.1
DPQ080  0.70  0.05 0.54 0.46 1.0
DPQ090  0.07  0.74 0.63 0.37 1.0
DPQ100  0.60  0.26 0.68 0.32 1.4

                       MR2  MR1
SS loadings           2.95 2.24
Proportion Var        0.29 0.22
Cumulative Var        0.29 0.52
Proportion Explained  0.57 0.43
Cumulative Proportion 0.57 1.00

 With factor correlations of 
     MR2  MR1
MR2 1.00 0.78
MR1 0.78 1.00

Mean item complexity =  1.1
Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are  45  and the objective function was  4.72 with Chi Square of  17223.44
The degrees of freedom for the model are 26  and the objective function was  0.14 

The root mean square of the residuals (RMSR) is  0.03 
The df corrected root mean square of the residuals is  0.04 

The harmonic number of observations is  3651 with the empirical chi square  246.89  with prob <  7.1e-38 
The total number of observations was  3651  with Likelihood Chi Square =  524.04  with prob <  3.7e-94 

Tucker Lewis Index of factoring reliability =  0.95
RMSEA index =  0.072  and the 90 % confidence intervals are  0.067 0.078
BIC =  310.76
Fit based upon off diagonal values = 1
Measures of factor score adequacy             
                                                   MR2  MR1
Correlation of (regression) scores with factors   0.93 0.94
Multiple R square of scores with factors          0.87 0.89
Minimum correlation of possible factor scores     0.74 0.78
names(fa.out)
 [1] "residual"      "dof"           "chi"           "nh"           
 [5] "rms"           "EPVAL"         "crms"          "EBIC"         
 [9] "ESABIC"        "fit"           "fit.off"       "sd"           
[13] "factors"       "complexity"    "n.obs"         "objective"    
[17] "criteria"      "STATISTIC"     "PVAL"          "Call"         
[21] "null.model"    "null.dof"      "null.chisq"    "TLI"          
[25] "RMSEA"         "BIC"           "SABIC"         "r.scores"     
[29] "R2"            "valid"         "score.cor"     "weights"      
[33] "rotation"      "communality"   "communalities" "uniquenesses" 
[37] "values"        "e.values"      "loadings"      "model"        
[41] "fm"            "rot.mat"       "Phi"           "Structure"    
[45] "method"        "scores"        "R2.scores"     "r"            
[49] "np.obs"        "fn"            "Vaccounted"   
plot(fa.out$values,type='b')

biplot(fa.out)

head(fa.out$scores)
            MR2        MR1
[1,] -0.3143623 -0.3168916
[2,] -0.5140691 -0.4382790
[3,]  0.7452267  0.2120120
[4,]  3.2561279  2.2674301
[5,] -0.6919326 -0.3763604
[6,] -0.7741730 -0.6857362

Let’s check factor scores of subject 122, who is really depressed!

x[122,]
    DPQ010 DPQ020 DPQ030 DPQ040 DPQ050 DPQ060 DPQ070 DPQ080 DPQ090
122      3      3      3      2      2      3      2      0      3
    DPQ100
122      3
fa.out$scores[122,]
     MR2      MR1 
5.193272 6.472891 

Really large numbers, look at the second score
Let’s get predicted values of x[122,] and they should be high

apply(x,2,mean) + fa.out$scores[122,]%*%t(as.matrix(fa.out$loadings[,1:2]))
       DPQ010   DPQ020   DPQ030   DPQ040   DPQ050   DPQ060   DPQ070
[1,] 4.350776 5.852346 3.323438 3.933419 3.722472 6.000532 4.341527
       DPQ080   DPQ090   DPQ100
[1,] 4.179323 5.194817 5.158928

Although 2nd factor is not significant according to Elbow rule, we still include it in analysis for its meaning!
Interpret the output as of numeric variables
But, care must be taken while generalizing FA of non-numeric variables

Interpretation:

Alternative way of doing same FA using correlation matrix but less output(no scores)

poly.R <- polychoric(x)$rho
fa.out <- fa(r=poly.R, nfactors=2, n.obs = nrow(dep[,2:11]))

What are factor loadings?

fa.diagram(fa.out, digits = 2)

The relationship of each variable to the underlying factor is expressed by the so-called factor loading. Let’s go through each part of the printed output.

round(cbind(fa.out$loadings,"h2"=fa.out$communality,"u2"=fa.out$uniquenesses,"com"=fa.out$complexity),2)
         MR2   MR1   h2   u2  com
DPQ010  0.54  0.15 0.44 0.56 1.15
DPQ020  0.19  0.67 0.69 0.31 1.17
DPQ030  0.69 -0.18 0.32 0.68 1.14
DPQ040  0.61 -0.06 0.32 0.68 1.02
DPQ050  0.54  0.05 0.34 0.66 1.02
DPQ060 -0.05  0.90 0.75 0.25 1.01
DPQ070  0.58  0.14 0.49 0.51 1.11
DPQ080  0.70  0.05 0.54 0.46 1.01
DPQ090  0.07  0.74 0.63 0.37 1.02
DPQ100  0.60  0.26 0.68 0.32 1.37
fa.out$Vaccounted
                            MR2       MR1
SS loadings           2.9498769 2.2362545
Proportion Var        0.2949877 0.2236254
Cumulative Var        0.2949877 0.5186131
Proportion Explained  0.5688010 0.4311990
Cumulative Proportion 0.5688010 1.0000000

The variance accounted for portion of the output can be explained as follows:

The variable with the strongest association to the underlying latent variable. Factor 1, is DPQ060 (Feeling bad about yourself), with a factor loading of 0.90.

Since factor loadings can be interpreted like standardized regression coefficients, one could also say that the variable DPQ060 has a correlation of 0.90 with Factor 1. This would be considered a strong association for a factor analysis in most research fields.

Two other variables, DPQ090 (Thought you would be better off dead) and DPQ020 (Feeling down, depressed, or hopeless), are also associated with Factor 1. Based on the variables loading highly onto Factor 1, we could call it “Low self-esteem individual awareness.”

DPQ010, DPQ030, DPQ040, DPQ050, DPQ070, DPQ080 and DPQ100, however, have high factor loadings on the Factor 2. They seem to indicate the observed signs of depression, so we may want to call Factor 2 “Physical depression signs.”

fa.out$Phi
          MR2       MR1
MR2 1.0000000 0.7835837
MR1 0.7835837 1.0000000

Factor correlations
Whether you get this part of the analysis depends on whether or not these are estimated. You have to have multiple factors and a rotation that allows for the correlations.

Model test results and Fit Indices (already discuss in SEM)

Rotation in EFA

An feature of EFA is that the axes of the factors can be rotated within the multidimensional variable space.
What a factor analysis program does while determining the best fit between the variables and the latent factors. We have 10 variables that go into a factor analysis.
The program looks first for the strongest correlations between variables and the latent factor, and makes that Factor 1. Visually, one can think of it as an axis (Axis 1). The factor analysis program then looks for the second set of correlations and calls it Factor 2, and so on. Sometimes, the initial solution results in strong correlations of a variable with several factors or in a variable that has no strong correlations with any of the factors.

fa.dem <- factor.rotate(fa.out,0,plot=TRUE,xlim=c(-1,1),ylim=c(-1,1), title="oblimin rotation")

Rotations that allow for correlation are called oblique rotations; rotations that assume the factors are not correlated are called orthogonal rotations. Our graph shows an orthogonal rotation. We run the example above using the oblimin (oblique). Below, we test with varimax (orthogonal):

fa.out <- fa(dep[,2:11], nfactors=2, cor='poly', rotate = "varimax")
fa.out
Factor Analysis using method =  minres
Call: fa(r = dep[, 2:11], nfactors = 2, rotate = "varimax", cor = "poly")
Standardized loadings (pattern matrix) based upon correlation matrix
        MR2  MR1   h2   u2 com
DPQ010 0.40 0.53 0.44 0.56 1.9
DPQ020 0.71 0.43 0.69 0.31 1.6
DPQ030 0.17 0.54 0.32 0.68 1.2
DPQ040 0.24 0.52 0.32 0.68 1.4
DPQ050 0.31 0.49 0.34 0.66 1.7
DPQ060 0.81 0.31 0.75 0.25 1.3
DPQ070 0.41 0.56 0.49 0.51 1.8
DPQ080 0.38 0.63 0.54 0.46 1.6
DPQ090 0.71 0.34 0.63 0.37 1.4
DPQ100 0.53 0.63 0.68 0.32 1.9

                       MR2  MR1
SS loadings           2.60 2.58
Proportion Var        0.26 0.26
Cumulative Var        0.26 0.52
Proportion Explained  0.50 0.50
Cumulative Proportion 0.50 1.00

Mean item complexity =  1.6
Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are  45  and the objective function was  4.72 with Chi Square of  17223.44
The degrees of freedom for the model are 26  and the objective function was  0.14 

The root mean square of the residuals (RMSR) is  0.03 
The df corrected root mean square of the residuals is  0.04 

The harmonic number of observations is  3651 with the empirical chi square  246.89  with prob <  7.1e-38 
The total number of observations was  3651  with Likelihood Chi Square =  524.04  with prob <  3.7e-94 

Tucker Lewis Index of factoring reliability =  0.95
RMSEA index =  0.072  and the 90 % confidence intervals are  0.067 0.078
BIC =  310.76
Fit based upon off diagonal values = 1
Measures of factor score adequacy             
                                                   MR2  MR1
Correlation of (regression) scores with factors   0.87 0.82
Multiple R square of scores with factors          0.75 0.68
Minimum correlation of possible factor scores     0.51 0.35

Sometimes, the orthogonal rotation did not work out. i.e. no variable is loading highly onto another factor.

op <- par(mfrow=c(1,2))
cluster.plot(fa.out,xlim=c(-1,1),ylim=c(-1,1),title="Unrotated ")
f2r <- factor.rotate(fa.out,-65,plot=TRUE,xlim=c(-1,1),ylim=c(-1,1),title="rotated -65 degrees")

Comparison of EFA and CFA

EFA CFA
What gets analyzed
Correlation matrix (of items = indicators) Covariance matrix (of items = indicators)
➢ Only correlations among observed item responses are used
➢ Only a standardized solution is provided, so the original means and variances of the observed item responses are irrelevant
➢ Means, variances, and covariances of item responses are analyzed
➢ Item response means historically have been ignored
➢Output includes unstandardized AND standardized solutions
Interpretation
Rotation Defining interpretation
➢ All items load on all factors (latent traits)
➢ Goal is to pick a rotation that gives closest approximation to “simple structure”
➢ No way of distinguishing latent variables due to traits being measured from correlation
➢ CFA is theory-driven: any structure becomes a testable hypothesis
➢ You specify number of latent factors and their inter-correlations
➢You specify which items load on which latent factors
➢ You specify any additional relationships for method/other covariance
➢ You just need a clue; you don’t have to be right (misfit is informative)
Model Fit
Eye-balls and Opinion Inferential tests via ML
➢ #Factors? Scree plots, interpretability
➢ Which rotation?
➢ Which items load on each factor? Arbitrary cutoff of .3-.4ish
➢ Standard errors infrequently used)
➢ Global model fit test (and local)
➢Standard errors (and significance tests) of item loadings, error variances, and error covariances (all parameters)
➢ Ability to test appropriateness of model constraints or model additions via tests for change in model fit
Factor scores
Don’t ever use factor scores from an EFA Factor scores can be used, but only if necessary
➢ Factor scores are indeterminate (especially due to rotation)
➢ Inconsistency in how factor models are applied to data
Best option: Test relations among latent factors directly through SEM
➢ Factor scores are less indeterminate in CFA, and could be used

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hai-mn/hai-mn.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Nguyen (2022, Sept. 19). HaiBiostat: Exploratory / Confirmatory Factor Analysis?. Retrieved from https://hai-mn.github.io/posts/2022-09-15-Factor Analysis/

BibTeX citation

@misc{nguyen2022exploratory,
  author = {Nguyen, Hai},
  title = {HaiBiostat: Exploratory / Confirmatory Factor Analysis?},
  url = {https://hai-mn.github.io/posts/2022-09-15-Factor Analysis/},
  year = {2022}
}